Random Forests with Missing Values in the Covariates
نویسندگان
چکیده
In Random Forests [2] several trees are constructed from bootstrapor subsamples of the original data. Random Forests have become very popular, e.g., in the fields of genetics and bioinformatics, because they can deal with high-dimensional problems including complex interaction effects. Conditional Inference Forests [8] provide an implementation of Random Forests with unbiased variable selection. Like the original Random Forests, they employ surrogate variables to handle missing values in the predictor variables. In this paper we report the results of an extensive simulation study covering both classification and regression problems under a variety of scenarios, including different missing value generating processes as well as different correlation structures between the variables. Moreover, a high dimensional setting with a high number of noise variables was considered in each case. The results compare the performance of Conditional Inference Forests with surrogate variables to that of knn imputation prior to fitting. The results show that while in some settings one or the other approach is slightly superior, there is no overall difference in the performance of Conditional Inference Forests with surrogate variables and with prior knn-imputation.
منابع مشابه
Imputation of Missing Values for Unsupervised Data Using the Proximity in Random Forests
This paper presents a new procedure that imputes missing values by random forests for unsupervised data. We found that it works pretty well compared with k-nearest neighbor (kNN) and rough imputations replacing the median of the variables. Moreover, this procedure can be expanded to semisupervised data sets. The rate of the correct classification is higher than that of other conventional method...
متن کاملSeven Techniques for Data Dimensionality Reduction Missing Values, Low Variance Filter, High Correlation Filter, PCA, Random Forests, Backward Feature Elimination, and Forward Feature Construction
متن کامل
مقایسه روش بیزی (Bayesian) و کلاسیک در برآرد پارامترهای مدل رگرسیون لجستیک با وجود مقادیر گمشده در متغیرهای کمکی
Background and Aim: Logistic regression is an analytic tool widely used in medical and epidemiologic research. In many studies, we face data sets in which some of the data are not recorded. A simple way to deal with such "missing data" is to simply ignore the subjects with missing observations, and perform the analysis on cases for which complete data are available. Materials and Methods: We c...
متن کاملEvaluation of Imputation of Covariates in an Impact Analysis With Regression Adjustment
In an impact analysis using random assignment, researchers often deal with missing values in both the covariates and the outcome variables of regression models. Clearly rigorous methods are needed to impute missing values in the outcome variables to minimize the potential bias in impact assessments. When imputation is applied to covariates of the regression analyses, the effect of imputation is...
متن کاملA comparison of the conditional inference survival forest model to random survival forests based on a simulation study as well as on two applications with time-to-event data
BACKGROUND Random survival forest (RSF) models have been identified as alternative methods to the Cox proportional hazards model in analysing time-to-event data. These methods, however, have been criticised for the bias that results from favouring covariates with many split-points and hence conditional inference forests for time-to-event data have been suggested. Conditional inference forests (...
متن کامل